{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing values\n",
    "\n",
    "Missing data, or Missing values, occur when __no data__ / __no value__ is stored for a certain observation within a variable. \n",
    "\n",
    "Missing data are a common occurrence both in data science competitions and in data in business settings, and can have a significant effect on the conclusions that can be drawn from the data. Incomplete data is an unavoidable problem in dealing with most data sources.\n",
    "\n",
    "### Why is data missing?\n",
    "\n",
    "The source of missing data can be very different and here are just a few examples:\n",
    "\n",
    "- A value is missing because it was forgotten or lost or not stored properly\n",
    "- For a certain observation, the value of the variable does not exist\n",
    "- The value can't be known or identified\n",
    "\n",
    "Imagine for example that the data comes from a survey, and the data are entered manually into an online form. The data entry could easily forget to complete a field in the form, and therefore, that value for that form would be missing.\n",
    "\n",
    "The person being asked may not want to disclose the answer to one of the questions, for example, their income. That would be then a missing value for that person.\n",
    "\n",
    "Sometimes, a certain feature can't be calculated for a specific individual. For example, in the variable 'total debt as percentage of total income' if the person has no income, then the total percentage of 0 does not exist. Therefore it will be a missing value.\n",
    "\n",
    "Together with understanding the source of missing data, it is important to understand the mechanisms by which missing fields are introduced in a dataset. Depending on the mechanism, we may choose to process the missing values differently. In addition, by knowing the source of missing data, we may choose to take action to control that source, and decrease the number of missing data looking forward during data collection.\n",
    "\n",
    "\n",
    "### Missing Data Mechanisms\n",
    "\n",
    "There are 3 mechanisms that lead to missing data, 2 of them involve missing data randomly or almost-randomly, and the third one involves a systematic loss of data.\n",
    "\n",
    "#### Missing Completely at Random, MCAR:\n",
    "\n",
    "A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. \n",
    "When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.\n",
    "\n",
    "If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.\n",
    "\n",
    "\n",
    "#### Missing at Random, MAR: \n",
    "\n",
    "MAR occurs when there is a systematic relationship between the propensity of missing values and the observed data. In other words, the probability an observation being missing depends only on available information (other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women that decided not to disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.\n",
    "\n",
    "In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.\n",
    "\n",
    "#### Missing Not at Random, MNAR: \n",
    "\n",
    "Missing of values is not at random (MNAR) if their being missing depends on information not recorded in the dataset. In other words, there is a mechanism or a reason why missing values are introduced in the dataset.\n",
    "\n",
    "Examples:\n",
    "\n",
    "MNAR would occur if people failed to fill in a depression survey because of their level of depression. Here, the missing of data is related to the outcome, depression.\n",
    "\n",
    "When a financial company asks for bank and identity documents from customers in order to prevent identity fraud, typically, fraudsters impersonating someone else will not upload documents, because they don't have them, precisely because they are fraudsters. Therefore, there is a systematic relationship between the missing documents and the target we want to predict: fraud.\n",
    "\n",
    "Understanding the mechanism by which data can be missing is important to decide which methods to use to handle the missing values. I will cover how to handle missing values in detail sections 5 and 6.\n",
    "\n",
    "\n",
    "\n",
    "## Real Life example: \n",
    "\n",
    "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n",
    "\n",
    "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
    "\n",
    "### Peer to peer lending: Finance\n",
    "\n",
    "Lending Club is a peer-to-peer Lending company based in the US. They match people looking to invest money with people looking to borrow money. When investors invest their money through Lending Club, this money is passed onto borrowers, and when borrowers pay their loans back, the capital plus the interest passes on back to the investors. It is a win for everybody as they can get typically lower loan rates and higher investor returns.\n",
    "\n",
    "If you want to learn more about Lending Club follow this [link](https://www.lendingclub.com/)\n",
    "\n",
    "The Lending Club dataset contains complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Features (aka variables) include credit scores, number of finance inquiries, address including zip codes and state, and collections among others. Collections indicates whether the customer has missed one or more payments and the team is trying to recover their money. The file is a matrix of about 890 thousand observations and 75 variables. More detail on this dataset can be found in [Kaggle's website](https://www.kaggle.com/wendykan/lending-club-loan-data)\n",
    "\n",
    "====================================================================================================\n",
    "\n",
    "To download the Titanic data, go ahead to the [Kaggle website](https://www.kaggle.com/c/titanic/data)\n",
    "\n",
    "Click on the link 'train.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset. Save it in a folder of your choice.\n",
    "\n",
    "For the Lending Club dataset. go to this website:\n",
    "Go ahead to this [website](https://www.kaggle.com/wendykan/lending-club-loan-data)\n",
    "\n",
    "Scroll down to the bottom of the page, and click on the link 'loan.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset. Unzip it, and save it to a directory of your choice.\n",
    "\n",
    "**Note that you need to be logged in to Kaggle in order to download the datasets**.\n",
    "\n",
    "If you save it in the same directory from which you are running this notebook, and you rename the file to 'titanic.csv' then you can load it the same way I will load it below.\n",
    "\n",
    "===================================================================================================="
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# to display the total number columns present in the dataset\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                             Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  \n",
       "0      0         A/5 21171   7.2500   NaN        S  \n",
       "1      0          PC 17599  71.2833   C85        C  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3      0            113803  53.1000  C123        S  \n",
       "4      0            373450   8.0500   NaN        S  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's load the titanic dataset\n",
    "\n",
    "data = pd.read_csv('titanic.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In python, the missing values are stored as NaN, see for example the first row for the variable Cabin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PassengerId      0\n",
       "Survived         0\n",
       "Pclass           0\n",
       "Name             0\n",
       "Sex              0\n",
       "Age            177\n",
       "SibSp            0\n",
       "Parch            0\n",
       "Ticket           0\n",
       "Fare             0\n",
       "Cabin          687\n",
       "Embarked         2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# you can determine the total number of missing values using\n",
    "# the isnull method plus the sum method on the dataframe\n",
    "\n",
    "data.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PassengerId    0.000000\n",
       "Survived       0.000000\n",
       "Pclass         0.000000\n",
       "Name           0.000000\n",
       "Sex            0.000000\n",
       "Age            0.198653\n",
       "SibSp          0.000000\n",
       "Parch          0.000000\n",
       "Ticket         0.000000\n",
       "Fare           0.000000\n",
       "Cabin          0.771044\n",
       "Embarked       0.002245\n",
       "dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# alternatively, you can call the mean method after isnull\n",
    "# to visualise the percentage of the dataset that \n",
    "# contains missing values for each variable\n",
    "\n",
    "data.isnull().mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are missing data in the variables Age, Cabin (in which the passenger was travelling) and Embarked, which is the port from which the passenger got into the Titanic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Missing data Not At Random (MNAR): Systematic missing values\n",
    "\n",
    "In this dataset, both the missing values of the variables Cabin and Age, were introduced systematically. For many of the people who did not survive, the **age** they had or the **cabin** they were staying in, could not be established. The people who survived could be asked for that information.\n",
    "\n",
    "Can we infer this by looking at the data?\n",
    "\n",
    "In a situation like this, we could expect a greater number of missing values for people who did not survive.\n",
    "\n",
    "Let's have a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7710437710437711"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we create a dummy variable that indicates whether the value\n",
    "# of the variable cabin is missing\n",
    "\n",
    "data['cabin_null'] = np.where(data.Cabin.isnull(), 1, 0)\n",
    "\n",
    "# find percentage of null values\n",
    "data.cabin_null.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, this value coincides with the one observed above when we called the .isnull().mean() method on the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Survived\n",
       "0    0.876138\n",
       "1    0.602339\n",
       "Name: cabin_null, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and then we evaluate the mean of the missing values in\n",
    "# cabin for the people who survived vs the non-survivors.\n",
    "\n",
    "# group data by Survived vs Non-Survived\n",
    "# and find nulls for cabin\n",
    "data.groupby(['Survived'])['cabin_null'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the percentage of missing values is higher for people who did not survive (0.87), respect to people that survived (0.60).\n",
    "This finding is aligned with our hypothesis that the data is missing because after the people died, the information could not be retrieved.\n",
    "\n",
    "Having said this, to truly underpin whether the data is missing not at random, we would need to get extremely familiar with the way data was collected. Analysing datasets, can only point us in the right direction or help us build assumptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Survived\n",
       "0    0.227687\n",
       "1    0.152047\n",
       "Name: age_null, dtype: float64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we repeat the exercise for the variable age:\n",
    "# First we create a dummy variable that indicates\n",
    "# whether the value of the variable Age is missing\n",
    "\n",
    "data['age_null'] = np.where(data.Age.isnull(), 1, 0)\n",
    "\n",
    "# and then look at the mean in the different survival groups:\n",
    "# there are more NaN for the people who did not survive\n",
    "data.groupby(['Survived'])['age_null'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we observe an increase in missing data for the people who did not survive the tragedy. The analysis therefore suggests: \n",
    "\n",
    "**There is a systematic loss of data: people who did not survive tend to have more information missing. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Missing data Completely At Random (MCAR)\n",
    "\n",
    "In the titanic dataset, there were also missing values for the variable Embarked, let's have a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>cabin_null</th>\n",
       "      <th>age_null</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>62</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Icard, Miss. Amelie</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>113572</td>\n",
       "      <td>80.0</td>\n",
       "      <td>B28</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>829</th>\n",
       "      <td>830</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Stone, Mrs. George Nelson (Martha Evelyn)</td>\n",
       "      <td>female</td>\n",
       "      <td>62.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>113572</td>\n",
       "      <td>80.0</td>\n",
       "      <td>B28</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass                                       Name  \\\n",
       "61            62         1       1                        Icard, Miss. Amelie   \n",
       "829          830         1       1  Stone, Mrs. George Nelson (Martha Evelyn)   \n",
       "\n",
       "        Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked  cabin_null  \\\n",
       "61   female  38.0      0      0  113572  80.0   B28      NaN           0   \n",
       "829  female  62.0      0      0  113572  80.0   B28      NaN           0   \n",
       "\n",
       "     age_null  \n",
       "61          0  \n",
       "829         0  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# slice the dataframe to show only those observations\n",
    "# with missing value for Embarked\n",
    "\n",
    "data[data.Embarked.isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These 2 women were travelling together, Miss Icard was the maid of Mrs Stone.\n",
    "\n",
    "A priori, there does not seem to be an indication that the missing information in the variable Embarked is depending on any other variable, and the fact that these women survived, means that they could have been asked for this information.\n",
    "\n",
    "Very likely this missingness was generated at the time of building the dataset and therefore we could assume that it is completely random. We can assume that the probability of data being missing for these 2 women is the same as the probability for this variable to be missing for any other person. Of course this will be hard, if possible at all, to prove."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Missing data At Random (MAR)\n",
    "\n",
    "For this example, I will use the Lending Club loan book. I will look specifically at the variables employer name (emp_title) and years in employment (emp_length), declared by the borrowers at the time of applying for a loan. The former refers to the name of the company for which the borrower works, the second one to how many years the borrower has worked for named company.\n",
    "\n",
    "Here I will show an example, in which a data point missing in one variable (emp_title) depends on the value entered on the other variable (emp_lenght)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>emp_title</th>\n",
       "      <th>emp_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>10+ years</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Ryder</td>\n",
       "      <td>&lt; 1 year</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>10+ years</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AIR RESOURCES BOARD</td>\n",
       "      <td>10+ years</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>University Medical Group</td>\n",
       "      <td>1 year</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  emp_title emp_length\n",
       "0                       NaN  10+ years\n",
       "1                     Ryder   < 1 year\n",
       "2                       NaN  10+ years\n",
       "3       AIR RESOURCES BOARD  10+ years\n",
       "4  University Medical Group     1 year"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's load the columns of interest from the Lending Club loan book dataset\n",
    "\n",
    "data=pd.read_csv('loan.csv', usecols=['emp_title','emp_length'])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "emp_title     0.057988\n",
       "emp_length    0.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's check the amount of missing data\n",
    "data.isnull().mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Around 6% of the observations contain missing data for employment title."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of different employer names: 299273\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([nan, 'Ryder', 'AIR RESOURCES BOARD', 'University Medical Group',\n",
       "       'Veolia Transportaton', 'Southern Star Photography',\n",
       "       'MKC Accounting ', 'Starbucks', 'Southwest Rural metro', 'UCLA',\n",
       "       'Va. Dept of Conservation/Recreation', 'Target', 'SFMTA',\n",
       "       'Internal revenue Service', \"Chin's Restaurant\", 'Duracell',\n",
       "       'Connection Inspection', 'Network Interpreting Service',\n",
       "       'Archdiocese of Galveston Houston', 'Osram Sylvania'], dtype=object)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's peek at the different employer names\n",
    "\n",
    "print('Number of different employer names: {}'.format(len(data.emp_title.unique())))\n",
    "data.emp_title.unique()[0:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['10+ years', '< 1 year', '1 year', '3 years', '8 years', '9 years',\n",
       "       '4 years', '5 years', '6 years', '2 years', '7 years', 'n/a'], dtype=object)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's inspect the variable emp_length\n",
    "data.emp_length.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10+ years    0.328573\n",
       "2 years      0.088880\n",
       "< 1 year     0.079566\n",
       "3 years      0.078913\n",
       "1 year       0.064341\n",
       "5 years      0.062774\n",
       "4 years      0.059196\n",
       "n/a          0.050514\n",
       "7 years      0.050254\n",
       "8 years      0.049534\n",
       "6 years      0.048401\n",
       "9 years      0.039055\n",
       "Name: emp_length, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's look at the percentage of borrowers within\n",
    "# each label / category of the emp_length variable\n",
    "\n",
    "data.emp_length.value_counts() / len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The label 'n/a', which in virtue could be similar to NaN, represents also around 5% of the dataset. So there could be a relationship between missing values in emp_title and 'n/a' in emp_length. Let's have a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'1 year': '0-10 years',\n",
       " '10+ years': '10+ years',\n",
       " '2 years': '0-10 years',\n",
       " '3 years': '0-10 years',\n",
       " '4 years': '0-10 years',\n",
       " '5 years': '0-10 years',\n",
       " '6 years': '0-10 years',\n",
       " '7 years': '0-10 years',\n",
       " '8 years': '0-10 years',\n",
       " '9 years': '0-10 years',\n",
       " '< 1 year': '0-10 years',\n",
       " 'n/a': 'n/a'}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the variable emp_length has many categories. I will summarise it\n",
    "# into 3 for simplicity:'0-10 years' or '10+ years' or 'n/a'\n",
    "\n",
    "# let's build a dictionary to re-map emp_length to just 3 categories:\n",
    "\n",
    "length_dict = {k:'0-10 years' for k in data.emp_length.unique()}\n",
    "length_dict['10+ years']='10+ years'\n",
    "length_dict['n/a']='n/a'\n",
    "\n",
    "# let's look at the dictionary\n",
    "length_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['10+ years', '0-10 years', 'n/a'], dtype=object)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's re-map the emp_length\n",
    "\n",
    "data['emp_length_redefined'] = data.emp_length.map(length_dict)\n",
    "data.emp_length_redefined.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "emp_length_redefined\n",
       "10+ years     0.054006\n",
       "0-10 years    0.082457\n",
       "n/a           0.863537\n",
       "Name: emp_length, dtype: float64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's calculate the proportion of working years\n",
    "# with same employer for those who miss data on employer name\n",
    "\n",
    "# number of borrowers for whom employer name is missing\n",
    "value = len(data[data.emp_title.isnull()])\n",
    "\n",
    "# % of borrowers for whom employer name is missing \n",
    "# within each category of employment length\n",
    "data[data.emp_title.isnull()].groupby(['emp_length_redefined'])['emp_length'].count().sort_values() / value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The majority of the missing values in the job title (emp_title) supplied by the borrower coincides with the label n/a of employment length. This supports the idea that the 2 variables are related."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "emp_length_redefined\n",
       "n/a           0.000467\n",
       "10+ years     0.345475\n",
       "0-10 years    0.654059\n",
       "Name: emp_length, dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's do the same for those bororwers who reported\n",
    "# the employer name\n",
    "\n",
    "# number of borrowers for whom employer name is present\n",
    "value = len(data.dropna(subset=['emp_title']))\n",
    "\n",
    "# % of borrowers within each category\n",
    "data.dropna(subset=['emp_title']).groupby(['emp_length_redefined'])['emp_length'].count().sort_values() / value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of borrowers who have reported an employer name and indicate n/a as employment length are minimal. Further supporting the idea that the missing values in employment length and employment length are related.\n",
    "\n",
    "'n/a' in 'employment length' could be supplied by people who are retired, or students, or self-employed. In all of those cases there would not be a number of years at employment to provide, therefore the customer would enter 'n/a' and leave empty the form at the side of 'employer_name'.\n",
    "\n",
    "In a scenario like this, a missing value in the variable emp_title depends on or is related to the 'n/a' label in the variable emp_length. And, this missing value nature is, in principle, independent of the variable we want to predict (in this case whether the borrower will repay their loan). How this will affect the predictions is unknown."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": true
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
